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Emotive  Qualities  in  Robot  Speech 

Cynthia  Breazeal 


Abstract — 

This  paper  explores  the  expression  of  emotion  in  syn¬ 
thesized  speech  for  an  anthropomorphic  robot.  We  have 
adapted  several  key  emotional  correlates  of  human  speech 
to  the  robot’s  speech  synthesizer  to  allow  the  robot  to  speak 
in  either  an  angry,  calm,  disgusted,  fearful,  happy,  sad,  or 
surprised  manner.  We  have  evaluated  our  approach  thor¬ 
ough  acoustic  analysis  of  the  speech  patters  for  each  vocal 
affect  and  have  studied  how  well  human  subjects  perceive 
the  intended  affect. 

Keywords — Human-robot  interaction,  emotive  expression, 
synthesized  speech. 

I.  Introduction 

There  is  a  growing  research  and  commercial  interest  in 
building  robots  that  can  interact  with  people  in  a  life¬ 
like  and  social  manner.  For  robotic  applications  where 
the  robot  and  human  establish  and  maintain  a  long  term 
relationship,  such  as  robotic  pets  for  children  or  robotic 
nursemaids  for  the  elderly,  communication  of  affect  is  im¬ 
portant.  There  have  been  a  number  of  projects  exploring 
models  of  emotion  for  robots  or  animated  life-like  charac¬ 
ters  [1],  [2],  [3],  [4],  [5],  the  recognition  of  emotive  states 
in  people  [6],  [7],  [8],  [9],  and  the  expression  of  affect  in 
facial  expression  [10],  [11],  [12]  and  body  movement  [13]. 
This  paper  explores  the  expression  of  emotion  in  synthe¬ 
sized  speech  for  an  anthropomorphic  robot  (called  Kismet) 
with  a  highly  expressive  face.  We  have  adapted  several 
key  emotional  correlates  of  human  speech  to  the  robot’s 
synthesizer  (based  on  DECTALK  v4-0)  to  allow  Kismet  to 
speak  in  either  an  angry,  calm,  disgusted,  fearful,  happy, 
sad,  or  surprised  manner.  We  have  evaluated  our  approach 
thorough  acoustic  analysis  of  the  speech  patters  for  each 
vocal  affect.  We  have  also  studied  how  well  human  sub¬ 
jects  perceive  the  intended  affect. 

It  is  well-accepted  that  facial  expressions  (related  to 
affect)  and  facial  displays  (which  serve  a  communication 
function)  are  important  for  verbal  communication.  Hence, 
Kismet’s  vocalizations  should  convey  the  affective  state  of 
the  robot.  This  provides  a  person  with  important  affective 
information  as  to  how  to  appropriately  engage  a  sociable 
robot  like  Kismet.  If  done  properly,  Kismet  could  then  use 
its  emotive  vocalizations  to  convey  disapproval,  frustra¬ 
tion,  disappointment,  attentiveness,  or  playfulness.  This 
fosters  richer  and  sustained  social  interaction,  and  helps 
to  maintain  the  person’s  interest.  For  a  compelling  verbal 
exchange,  it  is  also  important  for  Kismet  to  accompany  its 
expressive  speech  with  appropriate  motor  movements  of 
the  lips,  jaw,  and  face.  The  ability  to  lip  synchronize  with 
expressive  speech  strengthens  the  perception  of  Kismet  as 
a  social  creature  that  expresses  itself  vocally  and  through 
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facial  expression.  A  disembodied  voice  would  be  a  detri¬ 
ment  to  a  life-like  quality  of  interaction  that  we  would  like 
Kismet  to  have  with  people.  Synchronized  movements  of 
the  face  with  voice  both  complement  as  well  as  supplement 
the  information  transmitted  through  the  verbal  channel. 
In  earlier  work  we  have  presented  Kismet’s  emotion  sys¬ 
tem  and  its  expressive  facial  animation  system  (that  in¬ 
cludes  emotive  facial  expressions  and  lip  synchronization) 
[12].  This  paper  presents  our  work  in  giving  Kismet’s  voice 
emotive  qualities. 

II.  Emotion  in  Speech 

There  has  been  an  increasing  amount  of  work  in  iden¬ 
tifying  those  acoustic  features  that  vary  with  a  speaker’s 
affective  state  [14].  Figure  1  summarizes  the  effects  of  emo¬ 
tion  in  human  speech  that  tend  to  alter  the  pitch,  timing, 
voice  quality,  and  articulation  of  the  speech  signal  [15]. 
Several  of  these  features,  however,  are  also  modulated  by 
the  prosodic  effects  that  the  speaker  uses  to  communicate 
grammatical  structure  and  lexical  correlates.  These  tend  to 
have  a  more  localized  influence  on  the  speech  signal,  such 
as  emphasizing  a  particular  word.  For  recognition  tasks, 
this  increases  the  challenge  of  isolating  those  feature  char¬ 
acteristics  modulated  by  emotion.  Even  humans  are  not 
perfect  at  perceiving  the  intended  emotion  for  those  emo¬ 
tional  states  that  have  similar  acoustic  characteristics.  For 
instance,  surprise  can  be  perceived  or  understood  as  either 
joyous  surprise  (happiness)  or  apprehensive  surprise  (fear). 
Disgust  is  a  form  of  disapproval  and  can  be  confused  with 
anger.  Picard  (1997)  [6]  presents  a  nice  overview  of  work 
in  this  area. 


The  effect  of  emotions  on  the  human  voice 
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Fig.  1.  Typical  effect  of  emotions  on  adult  human  speech,  adapted 
from  Murray  and  Amott  (1993)  and  Picard  (1997). 

There  have  been  a  few  systems  developed  to  syn¬ 
thesize  emotional  speech.  For  instance,  Jun  Sato  (see 
www . ee . seikei . ac . jp/user/ junsato/research/)  trained 
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a  neural  network  to  modulate  a  neutrally  spoken  speech 
signal  (in  Japanese)  to  convey  one  of  four  emotional  states 
(happiness,  anger,  sorrow,  disgust).  The  neural  network 
was  trained  on  speech  spoken  by  Japanese  actors.  This 
approach  has  the  advantage  that  the  output  speech  sig¬ 
nal  sounds  more  natural  than  purely  synthesized  speech. 
For  our  interactive  robot  application,  this  approach  has 
the  disadvantage  that  the  speech  input  to  the  system  must 
be  prerecorded.  Kismet  must  be  able  to  generate  its  own 
utterances  to  suit  the  circumstance. 

The  Affect  Editor  by  Janet  Calm  is  among  the  earliest 
work  in  expressive  synthesized  speech  [15].  Her  system 
was  based  on  DECtalkS,  a  commercially  available  text-to- 
speech  speech  synthesizer.  Given  an  English  sentence  and 
an  emotional  quality  (one  of  anger,  disgust,  fear,  joy,  sor¬ 
row,  or  surprise),  she  developed  a  methodology  for  map¬ 
ping  the  emotional  correlates  of  speech  (changes  in  pitch, 
timing,  voice  quality,  and  articulation)  onto  the  underly¬ 
ing  DECtalk  synthesizer  settings.  She  took  great  care  to 
introduce  the  global  prosodic  effects  of  emotion  while  still 
preserving  the  more  local  influences  of  grammatical  and 
lexical  correlates  of  speech  intonation.  With  respect  to 
giving  Kismet  the  ability  to  generate  emotive  vocalizations, 
Cahn’s  work  is  a  valuable  resource  that  we  have  adapted 
and  extended  to  suit  our  purposes. 


Fig.  2.  Kismet’s  expressive  speech  GUI.  Listed  is  a  selection  of 
emotive  qualities,  the  vocal  affect  parameters,  and  the  synthesizer 
settings. 

III.  The  Expressive  Voice  Synthesis  System 

Emotions  have  a  global  impact  on  speech  since  they  mod¬ 
ulate  the  respiratory  system,  larynx,  vocal  tract,  muscular 
system,  heart  rate,  and  blood  pressure.  There  axe  an  as¬ 
sortment  of  vocal  affect  parameters  (VAP)  that  alter  the 
pitch,  timing,  voice  quality,  and  articulation  aspects  of  the 
speech  signal.  The  pitch-related  parameters  affect  the  pitch 
contour  of  the  speech  signal,  which  is  the  primary  contribu¬ 
tor  for  affective  information.  The  pitch-related  parameters 
include  accent  shape ,  average  pitch ,  pitch  contour  slope ,  fi¬ 
nal  lowering ,  pitch  range,  and  pitch  reference  line.  The 


timing-related  parameters  modify  the  prosody  of  the  vo¬ 
calization,  often  being  reflected  in  speech  rate  and  stress 
placement.  The  timing-related  parameters  include  speech 
rate,  pauses,  exaggeration ,  and  stress  frequency.  The  voice- 
quality  parameters  include  loudness ,  brilliance,  breathiness, 
laryngealization ,  pitch  discontinuity,  and  pause  discontinu¬ 
ity.  The  articulation  parameter  modifies  the  precision  of 
what  is  uttered,  either  being  more  enunciated  or  slurred. 
These  vocal  affect  parameters  are  described  in  more  detail 
below. 

Our  task  is  to  derive  a  mapping  of  these  physiological  vo¬ 
cal  affect  parameters  to  the  underlying  synthesizer  settings 
(we  use  DECTALK  v4-0)  to  convey  the  emotional  qualities 
of  anger,  fear,  disgust,  happiness,  sadness,  and  surprise  in 
Kismet’s  voice.  There  is  currently  a  single  fixed  mapping 
per  emotional  quality.  Figure  3  along  with  the  equations 
presented  in  this  paper  summarize  how  the  vocal  affect  pa¬ 
rameters  are  mapped  to  the  DECtalk  synthesizer  settings. 
The  default  values  and  max/min  bounds  for  these  settings 
are  given  in  Figure  4.  Figure  5  summarizes  how  each  emo¬ 
tional  quality  of  voice  is  mapped  onto  the  VAPs. 
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Fig.  3.  Percent  contributions  of  vocal  affect  parameters  to  DECtalk 
synthesizer  settings.  The  absolute  values  of  the  contributions  in  the 
far  right  column  add  up  to  1  (100%)  for  each  synthesizer  setting.  See 
the  equations  in  section  ??  for  the  mapping. 

A.  The  Vocal  Affect  Parameters  (VAPs) 

The  following  six  pitch  parameters  influence  the  pitch 
contour  of  the  spoken  utterance.  The  pitch  contour  is  the 
trajectory  of  the  fundamental  frequency,  f0,  over  time. 

•  Accent  Shape:  Modifies  the  shape  of  the  pitch  contour  for 
any  pitch  accented  word  by  varying  the  rate  of  /o  change 
about  that  word.  A  high  accent  shape  corresponds  to 
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speaker  agitation  where  there  is  a  high  peak  /0  and  a  steep 
rising  and  falling  pitch  contour  slope.  This  parameter  has 
a  substantial  contribution  to  DECtalk’s  stress  rise  set¬ 
ting,  which  regulates  the  fo  magnitude  of  pitch-accented 
words. 

•  Average  Pitch:  Quantifies  how  high  or  low  the  speaker 
appears  to  be  speaking  relative  to  their  normal  speech.  It  is 
the  average  fo  value  of  the  pitch  contour.  It  varies  directly 
with  DECtalk’s  average  pitch. 

•  Contour  Slope:  Describes  the  general  direction  of  the 
pitch  contour,  which  can  be  characterized  as  rising,  falling, 
or  level.  It  contributes  to  two  DECtalk  settings.  It  has 
a  small  contribution  to  the  assertiveness  setting,  and 
varies  inversely  with  the  baseline  fall  setting. 

•  Final  Lowering:  Refers  to  the  amount  that  the  pitch 
contour  falls  at  the  end  of  an  utterance.  In  general,  an 
utterance  will  sound  emphatic  with  a  strong  final  lowering, 
and  tentative  if  weak.  It  can  also  be  used  as  an  auditory 
cue  to  regulate  turn  taking.  A  strong  final  lowering  can 
signify  the  end  of  a  speaking  turn,  whereas  a  speaker’s  in¬ 
tention  to  continue  talking  can  be  conveyed  with  a  slight 
rise  at  the  end.  This  parameter  strongly  contributes  to 
DECtalk’s  assertiveness  setting  and  somewThat  to  the 
baseline  fall  setting. 

•  Pitch  Range:  Measures  the  bandwidth  between  the  max¬ 
imum  and  minimum  /0  of  the  utterance.  The  pitch  range 
expands  and  contracts  about  the  average  /0  of  the  pitch 
contour.  It  varies  directly  with  DECtalk’s  pitch  range 
setting. 

•  Reference  Line:  Controls  the  reference  pitch  fo  contour. 
Pitch  accents  cause  the  pitch  trajectory  to  rise  above  or 
dip  below  this  reference  value.  DECtalk’s  hat  rise  setting 
very  roughly  approximates  this. 

The  vocal  affect  timing  parameters  contribute  to  speech 
rhythm.  Such  correlates  arise  in  emotional  speech  from 
physiological  changes  in  respiration  rate  (changes  in 
breathing  patterns)  and  level  of  arousal. 

•  Speech  Rate:  Controls  the  rate  of  words  or  syllables  ut¬ 
tered  per  minute.  It  influences  how  quickly  an  individ¬ 
ual  word  or  syllable  is  uttered,  the  duration  of  sound  to 
silence  within  an  utterance,  and  the  relative  duration  of 
phoneme  classes.  Speech  is  faster  with  higher  arousal  and 
slower  with  lower  arousal.  This  parameter  varies  directly 
with  DECtalk’s  speech  rate  setting.  It  varies  inversely 
with  DECtalk’s  period  pause  and  comma  pause  settings 
as  faster  speech  is  accompanied  with  shorter  pauses. 

•  Stress  Frequency:  Controls  the  frequency  of  occurrence  of 
pitch  accents  and  determines  the  smoothness  or  abruptness 
of  fo  transitions.  As  more  words  are  stressed,  the  speech 
sounds  more  emphatic  and  the  speaker  more  agitated.  It 
filters  other  vocal  affect  parameters  such  as  precision  of 
articulation  and  accent  shape,  and  thereby  contributes  to 
the  associated  DECtalk  settings. 

Emotion  can  induce  not  only  changes  in  pitch  and  tempo, 
but  in  voice  quality  as  well.  These  phenomena  primarily 
arise  from  changes  in  the  larynx  and  articulatory  tract. 
The  voice  quality  parameters  are  as  follows: 


DECtalk  Synthesizer 
Setting 

unit 

neutral 

setting 

min 

setting 

max 

setting 

average  pitch 

Hz 

306 

260 

350 

assertiveness 

% 

65 

0 

100 

baseline  fall 

Hz 

0 

0 

40 

breathiness 

dB 

47 

40 

55 

comma  pause 

msec 

160 

-20 

800 

gain  of  frlcatlon 

dB 

72 

60 

80 

gain  of  aspiration 

dB 

70 

0 

75 

gain  of  voicing 

dB 

55 

65 

68 

hat  rise 

Hz 

20 

0 

80 

laryngealization 

% 

0 

0 

10 

loudness 

dB 

65 

60 

70 

lax  breathiness 

% 

75 

100 

0 

period  pause 

msec 

640 

-275 

800 

pitch  range 

% 

210 

50 

250 

quickness 

% 

50 

0 

100 

speech  rate 

wpm 

180 

75 

300 

richness 

% 

40 

0 

100 

smoothness 

% 

5 

0 

100 

stress  rise 

Hz 

22 

0 

80 

Fig.  4.  Default  DECtalk  synthesizer  settings  for  Kismet’s  voice  that 
are  used  in  the  equations  for  altering  these  values  to  produce  Kismet’s 
expressive  speech. 


•  Breathiness:  Controls  the  aspiration  noise  in  the  speech 
signal.  It  adds  a  tentative  and  weak  quality  to  the  voice, 
when  speaker  is  minimally  excited.  DECtalk  breathiness 
and  lax  breathiness  vary  directly  with  this. 

•  Brillance:  Controls  the  perceptual  effect  of  relative  en¬ 
ergies  of  the  high  and  low  frequencies.  When  agitated, 
higher  frequencies  predominate  and  the  voice  is  harsh  or 
“brilliant”.  When  speaker  is  relaxed  or  depressed,  lower 
frequencies  dominate  and  the  voice  sounds  soothing  and 
warm.  DECtalk’s  richness  setting  varies  directly  as  it 
enhances  the  lower  frequencies.  In  contrast,  DECtalk’s 
smoothness  setting  varies  inversely  since  it  attenuates 
higher  frequencies. 

•  Laryngealization:  Controls  the  perceived  creaky  voice 
phenomena.  It  arises  from  minimal  sub-glottal  pressure 
and  a  small  open  quotient  such  that  fo  is  low,  the  glottal 
pulse  is  narrow,  and  the  fundamental  period  is  irregular.  It 
varies  directly  with  DECtalk’s  laryngealization  setting. 

•  Loudness:  Controls  the  amplitude  of  the  speech  wave¬ 
form.  As  a  speaker  becomes  aroused,  the  sub- glottal  pres¬ 
sure  builds  which  increases  the  signal  amplitude.  As  a 
result,  the  voice  sounds  louder.  It  varies  directly  with 
DECtalk’s  loudness  setting.  It  also  influences  DECtalk’s 
gain  of  voicing. 

•  Pause  Discontinuity:  Controls  the  smoothness  of  fo  tran¬ 
sitions  from  sound  to  silence  for  unfilled  pauses.  Longer  or 
more  abrupt  silences  correlate  with  being  more  emotion¬ 
ally  upset.  It  varies  directly  with  DECtalk’s  quickness 
setting. 
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•  Pitch  Discontinuity :  Controls  smoothness  or  abruptness 
of  /o  transitions,  and  the  degree  to  which  the  intended  tar¬ 
gets  are  reached.  With  more  speaker  control,  the  transi¬ 
tions  are  smoother.  With  less  control,  they  transitions  are 
more  abrupt.  It  contributes  to  DECtalk’s  stress  rise 
and  quickness  settings. 

The  autonomic  nervous  system  modulates  articulation 
by  inducing  an  assortment  of  physiological  changes  such  as 
causing  dryness  of  mouth  or  increased  salivation.  There  is 
only  one  articulation  parameter  as  follows: 

•  Precision:  Controls  a  range  of  articulation  from  enun¬ 
ciation  to  slurring.  Slurring  has  minimal  frication  noise, 
whereas  greater  enunciation  for  consonants  results  in  in¬ 
creased  frication.  Stronger  enunciation  also  results  in 
an  increase  in  aspiration  noise  and  voicing.  The  preci¬ 
sion  of  articulation  varies  directly  with  DECtalk’s  gain  of 
frication,  gain  of  voicing,  and  gain  of  aspiration. 


Fig.  5.  The  mapping  from  each  expressive  quality  of  speech  to  the 
vocal  affect  parameters  (VAPs).  There  is  a  single  fixed  mapping  for 
each  emotional  quality. 


B.  Mapping  VAPs  to  Synthesizer  Settings 

This  section  presents  the  equations  that  map  the  vo¬ 
cal  affect  parameters  to  synthesizer  setting  values.  Linear 
changes  in  these  vocal  affect  parameter  values  result  in  a 
non-linear  change  in  the  underlying  synthesizer  settings. 
Furthermore,  the  mapping  between  parameters  and  syn¬ 
thesizer  settings  is  not  necessarily  one-to-one.  Each  pa¬ 
rameter  affects  a  percent  of  the  final  synthesizer  setting’s 
value  (figure  3).  When  a  synthesizer  setting  is  modulated 
by  more  than  one  parameter,  its  final  value  is  the  sum  of 
the  effects  of  the  controlling  parameters.  The  total  of  the 
absolute  values  of  these  percentages  must  be  100%.  See  fig¬ 
ure  4  for  the  allowable  bounds  of  synthesizer  settings.  The 
computational  mapping  occurs  in  three  stages.  The  vo¬ 
cal  affect  parameters  can  assume  integer  values  within  the 
range  of  (—10, 10).  Negative  numbers  correspond  to  lesser 
effects,  positive  numbers  correspond  to  greater  effects,  and 
zero  is  the  neutral  setting.  These  values  are  set  according 
to  the  current  specified  emotion  as  shown  in  figure  5. 

In  the  first  stage,  the  percentage  of  each  of  the  VAPs 
(VAPi)  to  its  total  range  is  computed,  ( PPi ).  This  is  given 


by  the  equation: 


pp  _  I'  APvaiuei  4-  V  AP offset 
*  “  V APmax  -  VAPmin 


VAPi  is  the  current  VAP  under  consideration,  V APvaiue  is 
its  value  specified  by  the  current  emotion,  VAP0ffaet  =  10 
adjusts  these  values  to  be  positive,  V APmax  =  10,  and 
VAPmin  =  -10. 

In  the  second  stage,  a  weighted  contribution  (WCjj)  of 
those  V  APi  that  control  each  of  DECtalk’s  synthesizer  set¬ 
tings  ( SSj )  is  computed.  The  far  right  column  of  figure  3 
specifies  each  of  the  corresponding  scale  factors  (SFj}i). 
Each  scale  factor  represents  a  percentage  of  control  that 
each  VAPi  applies  to  its  synthesizer  setting  SSj. 

For  each  synthesizer  setting,  SSj: 

For  each  corresponding  scale  factor,  SFj  i  of  VAP •: 

If  SFjti  >  0 
WCjy  i  =  PPi  x  SFj,i 
If  SFjri  <  0 

WCj,i  =  (1  -  PPi)  X  (-SFj4) 

SSj  =  £,  WCjti 


At  this  point,  each  synthesizer  value  has  a  value  0  < 
SSj  <  1.  In  the  final  stage,  each  synthesizer  setting  SSj 
is  scaled  about  0.5.  This  produces  the  final  synthesizer 
value,  SSjfinal .  The  final  value  is  sent  to  the  speech  syn¬ 
thesizer.  The  maximum,  Minimum,  and  default  values  of 
the  synthesizer  settings  are  shown  in  figure  4. 


For  each  final  synthesizer  setting,  SSjfinal 
Compute  SSjo/f„t  =  SSj  —  norm 


If 

>  0 

SSjfinal 

=  SSjdefault  + 

(2  x  SSjofft9t  x 

(SSjma.  - 

If 

SSjof/.et 

<0 

SSjfinal 

=  SSjd€fault  + 

(2  x  SSjoff"t  x 

(SSjdefault 

SSjmin)) 

-SSjtnin)) 


IV.  Kismet’s  Expressive  Utterances 

Given  a  string  to  be  spoken  and  the  updated  synthesizer 
settings,  Kismet  can  vocally  express  itself  with  different 
emotional  qualities  (anger,  disgust,  fear,  joy,  sorrow,  or 
surprise).  To  evaluate  Kismet’s  speech,  we  analyzed  the 
produced  utterances  with  respect  to  the  acoustical  corre¬ 
lates  of  emotion.  This  reveals  whether  the  implementation 
produces  similar  acoustical  changes  to  the  speech  wave¬ 
form  given  a  specified  emotional  state.  We  also  evaluated 
how  the  affective  modulations  of  the  synthesized  speech  are 
perceived  by  human  listeners. 

.1  Analysis  of  Speech 

To  analyze  the  performance  of  the  expressive  vocaliza¬ 
tion  system,  we  extracted  the  dominant  acoustic  features 
that  are  highly  correlated  with  emotive  state.  The  acous¬ 
tic  features  and  their  modulation  with  emotion  are  sum¬ 
marized  in  figure  1.  Specifically,  these  are  average  pitch, 
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pitch  range,  pitch  variance,  and  mean  energy.  To  measure 
speech  rate,  we  extracted  the  overall  time  to  speak  and  the 
total  time  of  voiced  segments. 
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Fig.  6.  Table  of  acoustic  features  for  the  three  utterances. 

Features  were  extracted  from  three  phrases: 

•  Look  at  that  picture 

•  Go  to  the  city 

•  It's  been  moved  already 

The  results  are  summarized  in  figure  6.  The  values  for 
each  feature  are  displayed  for  each  phrase  with  each  emo¬ 
tive  quality  (including  the  neutral  state).  The  averages  are 
also  presented  in  the  table  and  plotted  in  figure  7.  These 
plots  easily  illustrate  the  relationship  of  how  each  emotive 
quality  modulates  these  acoustic  features  with  respect  to 
one  another.  The  pitch  contours  for  each  emotive  quality 
are  shown  in  figure  8.  They  correspond  to  the  utterance 
“It’s  been  moved  already.”  Relating  these  plots  with  fig¬ 
ure  1,  it  is  clear  that  many  of  the  acoustic  correlates  of 
emotive  speech  are  preserved  in  Kismet’s  speech. 

Kismet’s  vocal  quality  varies  with  its  emotive  state  as 
follows: 

•  Fearful  speech  is  very  fast  with  wide  pitch  contour,  large 
pitch  variance,  very  high  mean  pitch,  and  normal  inten¬ 
sity.  I  have  added  a  slightly  breathy  quality  to  the  voice  as 
people  seem  to  associate  it  with  a  sense  of  trepidation. 

•  Angry  speech  is  loud  and  slightly  fast  with  a  wide  pitch 
range  and  high  variance.  We’ve  purposefully  implemented 
a  low  mean  pitch  to  give  the  voice  a  prohibiting  qual¬ 
ity.  This  differs  from  figure  1,  but  a  preliminary  study 
demonstrated  a  dramatic  improvement  in  recognition  per¬ 
formance  of  naive  subjects.  This  makes  sense  as  it  gives 
the  voice  a  threatening  quality. 

•  Sad  speech  has  a  slower  speech  rate,  with  longer  pauses 
than  normal.  It  has  a  low  mean  pitch,  a  narrow  pitch  range 
and  low  variance.  It  is  softly  spoken  with  a  slight  breathy 
quality.  This  differs  from  figure  1,  but  it  gives  the  voice  a 
tired  quality.  It  has  a  pitch  contour  that  falls  at  the  end. 
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Fig.  7.  Plots  of  acoustic  features  of  Kismet’s  speech.  Plots  illustrate 
how  each  emotion  relates  to  the  others  for  each  acoustic  feature.  The 
horizontal  axis  simply  maps  an  integer  value  to  each  emotion  for  ease 
of  viewing  (anger=l,  calm=2,  etc.) 

•  Happy  speech  is  relatively  fast,  with  a  high  mean  pitch, 
wide  pitch  range,  and  wide  pitch  variance.  It  is  loud  with 
smooth  undulating  inflections  as  shown  in  figure  8. 

•  Disgusted  speech  is  slow  with  long  pauses  interspersed.  It 
has  a  low  mean  pitch  with  a  slightly  wide  pitch  range.  It  is 
fairly  quiet  with  a  slight  creaky  quality  to  the  voice.  The 
contour  has  a  global  downward  slope  as  shown  in  figure  8. 

•  Surprised  speech  is  fast  with  a  high  mean  pitch  and  wide 
pitch  range.  It  is  fairly  loud  with  a  steep  rising  contour  on 
the  stressed  syllable  of  the  final  word. 

.2  Human  Listener  Experiments 

To  evaluate  Kismet’s  expressive  speech,  nine  subjects 
were  asked  to  listen  to  prerecorded  utterances  and  to  fill 
out  a  forced-choice  questionnaire.  Subjects  ranged  from  23 
to  54  years  of  age,  all  affiliated  with  MIT.  The  subjects  had 
very  limited  to  no  familiarity  with  Kismet’s  voice. 

In  this  study,  each  subject  first  listened  to  an  introduc¬ 
tion  spoken  with  Kismet’s  neutral  expression.  This  was  to 
acquaint  the  subject  with  Kismet’s  synthesized  quality  of 
voice  and  neutral  affect.  A  series  of  eighteen  utterances 
followed,  covering  six  expressive  qualities  (anger,  fear,  dis¬ 
gust,  happiness,  surprise,  and  sorrow).  Within  the  ex¬ 
periment,  the  emotive  qualities  were  distributed  randomly. 
Given  the  small  number  of  subjects  per  study,  we  only  used 
a  single  presentation  order  per  experiment.  Each  subject 
could  work  at  his/her  own  pace  and  control  the  number  of 
presentations  of  each  stimulus. 

The  three  stimulus  phrases  were:  “I’m  going  to  the  city,” 
“I  saw  your  name  in  the  paper,”  and  “It’s  happening  to¬ 
morrow.”  The  first  two  test  phrases  were  selected  because 
Cahn  (1990)  had  found  the  word  choice  to  have  reasonably 
neutral  affect.  In  a  previous  version  of  the  study,  subjects 
reported  that  it  was  just  as  easy  to  map  emotional  cor¬ 
relates  onto  English  phrases  as  to  Kismet’s  non-linguistic 
vocalizations  (akin  to  infant-like  babbles).  Their  perfor¬ 
mance  for  English  phrases  and  Kismet’s  babbles  supports 
this. 

Using  a  forced  choice  paradigm,  the  subjects  were  sim- 
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Fig.  8.  Pitch  analysis  of  Kismet’s  speech  for  the  English  phrase  “It’s 
been  moved  already.” 


ply  asked  to  circle  the  word  which  best  described  the  voice 
quality.  The  choices  were  “anger,”  “disgust,”  “fear/panic,” 
“happy,”  “sad,”  “surprise/excited.”  From  a  previous  iter¬ 
ation  of  the  study,  we  found  that  word  choice  mattered. 
A  given  emotion  category  can  have  a  wide  range  of  vocal 
affects.  For  instance,  the  subject  could  interpret  “fear” 
to  imply  “apprehensive,”  which  might  be  associated  with 
Kismet’s  whispery  vocal  expression  for  sadness.  Alterna¬ 
tively,  it  could  be  associated  with  “panic”  which  is  a  more 
aroused  interpretation.  The  results  from  these  evaluations 
are  summarized  in  figure  9. 

Overall,  the  subjects  exhibited  reasonable  performance 
in  correctly  mapping  Kismet’s  expressive  quality  with 
the  targeted  emotion.  However,  the  expression  of  “fear” 
proved  somewhat  problematic.  For  all  other  expressive 
qualities,  the  performance  was  significantly  above  ran¬ 
dom.  Furthermore,  misclassifications  were  highly  corre¬ 
lated  to  similar  emotions.  For  instance,  “anger”  was  some¬ 
times  confused  with  “disgust”  (sharing  negative  valence)  or 
“surprise/excitement”  (both  sharing  high  arousal).  “Dis¬ 
gust”  was  confused  with  other  negative  emotions.  “Fear” 
was  confused  with  other  high  arousal  emotions  (with 
“surprise/excitement”  in  particular).  The  distribution 
for  “happy”  was  more  spread  out,  but  it  was  most  of¬ 
ten  confused  with  “surprise/excitement,”  with  which  it 
shares  high  arousal.  Kismet’s  “sad”  speech  was  confused 
with  other  negative  emotions.  The  distribution  for  “sur¬ 
prise/excitement”  was  broad,  but  it  was  most  often  con¬ 
fused  for  “fear.” 


V.  Summary 

For  the  purposes  of  evaluation,  the  current  set  of  data 
is  promising.  Misclassifications  are  particularly  informa¬ 
tive.  The  mistakes  are  highly  correlated  with  similar  emo¬ 
tions,  which  suggests  that  arousal  and  valence  are  conveyed 
to  people  (arousal  being  more  consistently  conveyed  than 
valence).  We  are  using  the  results  of  this  study  to  im¬ 
prove  Kismet’s  expressive  qualities.  In  addition,  Kismet  ex¬ 
presses  itself  through  multiple  modalities,  not  just  through 
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Fig.  9.  Naive  subjects  assessed  the  emotion  conveyed  in  Kismet’s 
voice  in  a  forced-choice  evaluation.  All  emotional  qualities  were  recog¬ 
nized  with  reasonable  performance  except  for  “fear”  which  was  most 
often  confused  for  “surprise/excitement.”  Both  expressive  qualities 
share  high  arousal,  so  the  confusion  is  not  unexpected. 


voice.  We  believe  that  Kismet’s  facial  expression  and  body 
posture  should  help  resolve  the  ambiguities  encountered 
through  voice  alone. 
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